I. The Business Problem & Our Insights

1. What Is the Case?

We are consulting for a real estate company that has a niche in rental property investment for short-term gains as part of their business model, specifically within New York City. The real estate company has already concluded that two bedroom properties are the most profitable. Now, they need to find out which zip codes are the best to invest in in terms of generating the most profit on short term retals within New York City.

The primary data sources from both Zillow and AirBnB are public available. For the cost data, Zillow provides us an estimate of value for two-bedroom properties. For the revenue data, AirBnB is the medium through which the investor plans to lease out their investment property.

Our objective is here is to find a way to help this real estate company to find out the zipcodes that can generating the most short-term return from the two bedrooms properties they invest. And the main approach we use will be centered around the exploratory analysis along with business interpretation.

2. What Is Success?

Explicitly, we define the success of this analytics project as providing top zipcodes in New York City meeting the client’s expectations. In addtion to answering this specific business question, we should also clarify our exploratory logic and conduct the analysis in a scalable manner so it can be applied with either new data or new scenarios in the future.

To achieve this goal, our analysis should facilitate the company to dive deep in the collected data for hidden insights by primarily focusing on:

1) deploy sophisticated data treatment to filter out those useful information.
2) interpret the analysis result accordingly to decide the zipcodes with high profitability.

3. What Are the Assumptions?

Initially, there are four main assumptions provided by the client:
1. The occupancy rate is 75%.
2. All investment will be paid in cash at once with no mortgage, thus no concerns on interest rate.
3. The time value of money discount rate is 0%.
4. All properties and all square feet within each locale can be assumed to be homogeneous.

As the anlysis goes, more assumptions were made by us in order to simplify the procedure without critical influence on the integrity of our analysis result. Those assumptions are mentioned in the following context respectively and also summarized as followed:

1. When calculate the Cap Rate, instead of net operating income, we use annual revenue without concerns on expense.
2. When using zillow dataset to predict the current cost of properties in July 2019 (latest data in airbnb dataset), we assume there is seasonality in the price and also that values depend not only on previous values (Auto Regressive AR) but also on diiferences between previous values (Moving Average MA).
3. About the price of those property that rent out the space as private rooms, we assume that both rooms are booked at the same time. Thus, the occupancy rate is same as regular entire apartment.
4. When calculate revenue, we assume cleaning fee is not considered.
5. In addition to availability, we intuitively assume that more reviews and longer listeing time equate to popular properties and popular neighborhood/zipcode
6. No concern on property appreciation and other financial factors.

II. Data Processing

In this section, we have four parts:
1) Initial setups, including loading R libraries and reading datasets
2) Data treatment on airbnb dataset, including cleaning, munging, and filtering
3) Data treatment on zillow dataset, including cleaning, munging, and filtering
4) Data join, along with some some simple feature engineering

1. Initial Set-up

1.1. Install and Load Packages

In this part, to simplify the process of installing, updating, and loading all necessary R packages, we used a special open source package lubripack. This package can be used to easily load multiple pacakges at once and automatically install and update them when necessary.

## 
## Bellow Packages Successfully Installed:
## 
##   data.table        dplyr      ggplot2        tidyr       naniar 
##         TRUE         TRUE         TRUE         TRUE         TRUE 
##   colorspace     forecast        astsa       Amelia       scales 
##         TRUE         TRUE         TRUE         TRUE         TRUE 
##       GGally       Matrix       plotly         maps   kableExtra 
##         TRUE         TRUE         TRUE         TRUE         TRUE 
##  matrixStats         mice        caret DataExplorer     processx 
##         TRUE         TRUE         TRUE         TRUE         TRUE

1.2. Load Datasets

  1. AirBnB data: (Downloaded from http://data.insideairbnb.com/united-states/ny/new-york-city/2019-07-08/data/listings.csv.gz) This is the dataset that will provide revenue infomation for rental properties listed in the New York area. It was last uplated in July 2019.

  2. Zillow: (Provided seperately by the client) This ZIllow Home Value INdex(ZHVI) file is the time series dataset that will provide valuation for 2 bedrooms properties across multiple states, including New York, in a wide range of time period from the 90s to June 2017.

1.2.1. Datasets Inspection

First, we should have a basic idea about the dimension of those two datasets.

## [1] 48895   106
## [1] 8946  262

The Airbnb listings data contains 48895 rows and 106 columns, while the Zillow dataset contains 8946 rows and 262 columns.

See Appendix 1 and 2 for the complete list for column names and summary of missing values for both datasets

2. Data Treatment for airbnb Dataset

2.1. Data Cleaning – airbnb Dataset

First of all, we will conduct some data cleaning procedures seperately before we proceed to extract only the data for 2 bedrooms properties in NYC. Then we will generate a user defined function to help filtering out useful and qualified columns. The entire cleaning procedure can also be used in the future when newer airbnb data become avaiable.

2.1.1. Missing Values

As the result indicated (see appendix 1), the majority of description columns and host related information are blank. Steps are taken in the later section to filter columns with higher percentage of Nulls/NA. We will focusing on those columns that are critical for our analysis.

First of all, unfortunately, zipcode column contains 488 missing values. Ignoring these values can partially compromise the intergrity of the analysis. Zipcodes are imputed by selecting a non-NA value from Neighbourhood Group Cleansed.

## [1] 0

Total Number of NA values in Zipcode Column after Imputation is 0.

The rest of

2.1.3. Negative or Zero Valued Columns

None of the price columns: Price, weekly price & Monthly price should have zero or negative value.

2.1.4. Columns that need more detailed inspection: bedrooms, beds.

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
##  4569 34985  6497  2069   525   147    42    21     8     5     3     1 
##    14 
##     1
## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
##  1094 31104 10392  3635  1508   571   287    89    67    33    19    17 
##    12    13    14    15    16    17    21    22    26    40 
##    15     8     2     4     4     1     2     1     1     1

There are 0 or NA input in these three columns, but no negative input was found. However, it is unusual that there are 4569 properties that have no bedroom while there are 1094 properties even have no bed. But it is possible that those property are having other kind of beds, especailly given that there is no missing value in the bed_type column.

## 
##        Airbed         Couch         Futon Pull-out Sofa      Real Bed 
##           170            73           295           261         48096

This result confirms our assumption. Thus we don’t need to pay extra attention to the number of bedrooms at this point.
Also, there are 11 inputs in the price column that has a non-positive value. Since the number is fairly small, we can drop them without affecting the integrity of the analysis.

2.2. Data Filtering – airbnb

First of all, we filter out those records that are within our business scope to simplify the process and save computational power. In this case, it will be those 2 bedrooms properties in New York City.

Then, in order to remove columns that add little or no value to the analysis in a scalable way, some of the smart data munging techniques are incorporated. These include removing: 1) columns based on pattern matching with their names, 2) imbalanced columns, and 3) character columns with 100 % variance, etc.

For the main part in data filtering, we use associated methods to remove these columns from airbnb dataset and store the procedures as a user defined function (UDF) for future use. After that, we mannully exclude columns that contribute no significant value for further analysis.

2.2.1. Business Scope - New York City & 2 Bedrooms

Before proceeding to the next step, keep only those records that are from the New York City and has 2 bedrooms in the property given the business context. However, as it was noticed that there are many variations of NYC in the city column that requires much effort to eliminate, we can simply use state at this point for a preliminary procedure.

## [1] NY       Ny       ny       MP       New York CA       NJ               
## [9] <NA>    
## Levels:  CA MP NJ NY New York Ny ny

Although the airbnb dataset per se is generated with the criteria of listing in New York City, it has many variations in the city column and even include cities from other state.

state is chosen here for filtering rather than city because it is less likely to have typos in the name. Also, since we are going to merge Airbnb data and Zillow data, a preliminary filtering here will be sufficient to reduce the computational cost. More work can be done if needed as we proceed. The steps are listed here as reference for other applications in the future.

Now, the airbnb dataset includes only those records that are within our business scope of 2 bedrooms rental properties in New York City.

2.2.2. Scalable Data Munging

We creat a cleanAibnb function here as a scalable solution to removes columns that will not be very useful in our case of analyze zipcode profitability. For reference purpose, all dropped columns are also listed. See Appendix 3 for details.

1. Zero Variance Cols

Imbalanced/Zero Variance columns add no value to the analysis. These columns are removed using nearzeroVar method from Caret package.

2. Pattern Matching

Column names starting with “require”, “host”, “calendar” and ending with “url” and “nights” are irrelevant information when the property is invested in, by the real estate company.

3. Based on NA Values

Columns with over 60% NA values are removed without mercy.

The clean Airbnb Function

Call the cleanAirbnb function and see what are those columns that have been removed.

## [1] "------ removed due to Zero Variance------"
##  [1] "scrape_id"                                  
##  [2] "experiences_offered"                        
##  [3] "thumbnail_url"                              
##  [4] "medium_url"                                 
##  [5] "xl_picture_url"                             
##  [6] "host_acceptance_rate"                       
##  [7] "host_has_profile_pic"                       
##  [8] "state"                                      
##  [9] "market"                                     
## [10] "country_code"                               
## [11] "country"                                    
## [12] "bedrooms"                                   
## [13] "bed_type"                                   
## [14] "has_availability"                           
## [15] "requires_license"                           
## [16] "license"                                    
## [17] "jurisdiction_names"                         
## [18] "is_business_travel_ready"                   
## [19] "require_guest_profile_picture"              
## [20] "require_guest_phone_verification"           
## [21] "calculated_host_listings_count_shared_rooms"
## [1] "--------------------------------------"
## [1] "------Columns removed due to defined patterns------"
##  [1] "host_id"                   "host_url"                 
##  [3] "host_name"                 "host_since"               
##  [5] "host_location"             "host_about"               
##  [7] "host_response_time"        "host_response_rate"       
##  [9] "host_is_superhost"         "host_thumbnail_url"       
## [11] "host_picture_url"          "host_neighbourhood"       
## [13] "host_listings_count"       "host_total_listings_count"
## [15] "host_verifications"        "host_identity_verified"   
## [17] "calendar_updated"          "calendar_last_scraped"    
## [19] "listing_url"               "picture_url"              
## [21] "minimum_nights"            "maximum_nights"           
## [23] "minimum_minimum_nights"    "maximum_minimum_nights"   
## [25] "minimum_maximum_nights"    "maximum_maximum_nights"   
## [1] "--------------------------------------"
## [1] "------Columns removed due to more than 60% NA values------"
## [1] "square_feet"   "weekly_price"  "monthly_price"
## [1] "--------------------------------------"

One unexpected column that has been removed is the square_feet column with 48487 NA values, which is approximately 99.2% of the total number of records. Since this information is usually important for real estate investment, some external data can be included if necessary. For this case, we will just leave it out of our concern.

2.2.3. Mannually Remove Redundanct Columns

Character columns with near 100% variance (Every Row is different) are removed as they provide no group level information that can be used on a larger population/scale. These columns include textual columns describing the home, host, amenties etc. For lack of conclusion from other variables, these columns can be revisited for sentiment analysis.

Also, other columns that provide no specific benefit for our profitibility analysis in this case are also removed.

Duplicated Rows Detection

Last step in data cleaning, let’s double check if there is any duplicated rows.

No duplicated rows has been found in the cleaned airbnb dataset.

3. Data Treatment for zillow Dataset

3.1. Data Cleaning – zillow Dataset

Majority of columns in zillow Dataset are numeric input account for the median price of 2-bedroom properties between year 1996 and 2017 and spread monthly.

3.1.1. Missing Values Detection

Median Price for early years (1996-2013) has plenty of Nulls as shown in the Appendix 2. This is also not consistent across all Regionnames. Steps are taken in the next section to filter out columns with high percentage of Nulls/NA.

3.1.2. Negative or Zero Value

## costZeroNeg
##   0 
## 262

There is no negative or zero input in all the non character columns(Int/Numeric) as shown in the output table.

3.2. Data Filtering – Zillow dataset

3.2.1. Scalable Data Munging

The zillow dataset only provide cost information as late as June 2017, while the airbnb dataset is two years ahead. It is obvious that this discrepency between cost and revenue will compromise the integrity of our analysis. So we use Time Series Forecasting (ARIMA in this case) to predict the lastest price.

We incorporated this predicitive model within a function called cleanZillow where the input to the function is the zillow Dataset and the city where we want to buy the property. The various steps performed in the function are as followed:

  1. Select only relevant columns such as RegionName which denotes the zipcode, city, SizeRank, and the predicted cost of the property in July 2019.

  2. Filter out the data to include only those rows which belong to the city provided as a function argument, which in our case is New York City.

  3. Due to shortage of time, we will assume that there is seasonality in the price and also that values depend not only on previous values (Auto Regressive AR) but also on diiferences between previous values (Moving Average MA).

Attention: we applied ARIMA model with the assumpition of having seasonality to predict the cost of the properties in Zipcodes from July 2017 to July 2019. We then attach the price of property in July 2019 (calculated at zipcode level) to a new column named cost.

The clean Zillow Function

We call the above function by passing the available Zillow Data and also the city name as New York

The final clean data contains 4 columns i.e. Zipcode, City, SizeRank and the current price of property in the particular zipcode. There are 25 rows where each row describes unqiue zipcode.

Duplicated Rows Detection

Last step in data cleaning, let’s double check if there is any duplicated rows.

No duplicated rows has been found in the cleaned zillow dataset.

4. Data Join, Inspection, and Feature Engineering

The final dataset containing both revenue and cost data is merged based on common zipcode in both cleaned airbnb and cleaned zillow dataset.

## [1] 1577   21

Let’s have a preview on the merged dataset final.

## Observations: 1,577
## Variables: 21
## $ zipcode                      <fct> 10003, 10003, 10003, 10003, 10003, …
## $ id                           <fct> 31352270, 18201247, 8595349, 428619…
## $ neighbourhood_cleansed       <fct> NoHo, East Village, East Village, E…
## $ neighbourhood_group_cleansed <fct> Manhattan, Manhattan, Manhattan, Ma…
## $ latitude                     <dbl> 40.72811, 40.73218, 40.72756, 40.73…
## $ longitude                    <dbl> -73.99381, -73.98815, -73.98588, -7…
## $ property_type                <fct> Loft, Apartment, Apartment, Apartme…
## $ room_type                    <fct> Entire home/apt, Entire home/apt, E…
## $ price                        <dbl> 525, 285, 280, 199, 100, 999, 250, …
## $ cleaning_fee                 <dbl> 120, 150, 50, 0, 50, 200, 95, 25, 1…
## $ availability_30              <int> 13, 29, 0, 4, 0, 23, 0, 0, 4, 29, 1…
## $ availability_60              <int> 43, 59, 0, 5, 0, 44, 13, 0, 34, 59,…
## $ availability_90              <int> 63, 89, 0, 7, 0, 74, 13, 0, 35, 89,…
## $ availability_365             <int> 63, 364, 0, 232, 0, 164, 13, 0, 197…
## $ number_of_reviews            <int> 5, 0, 2, 8, 9, 11, 20, 11, 27, 0, 0…
## $ first_review                 <fct> 2019-04-15, NA, 2016-01-02, 2016-02…
## $ last_review                  <fct> 2019-06-02, NA, 2016-05-28, 2018-12…
## $ review_scores_rating         <int> 64, NA, 80, 90, 93, 100, 93, 98, 92…
## $ City                         <fct> New York, New York, New York, New Y…
## $ SizeRank                     <int> 21, 21, 21, 21, 21, 21, 21, 21, 21,…
## $ cost                         <dbl> 2214644, 2214644, 2214644, 2214644,…

4.1. Correction on Zipcode 10013 & Its Neighbourhood Group

Upon inspection on the matching of zipcode and neighbourhood group, we find that 10013 has been assigned to both Manhattan and Brooklyn, which should not have happend.

After research, it is clear that 10013 should only belong to Manhattan, thus we make the correction here.

4.2. Correction on Price for Properties listed as Private Room

For those 2 bedrooms listings that offer private rooms, price of the daily rental in Revenue data is reflective of the space that is offered but not the entire property itself. The price must be specifically corrected to account for entire property to account the benefit.

Assumption Made: For those property that rent out the space as private rooms, we assume that both rooms are booked at the same time. Thus, the occupancy rate is same as regular entire apartment. If the property type is Private Room, it is multipled by number of bedrooms to account for overall price. Correction applied is returned to original price column.

4.3. UDF – Normalization

Create a function to normalize certain columns into 0-1 scale based on the min-max value of the column so they can contribute value to the analysis in the following section.

First, we will normalize the number_of_reviews column.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.002481 0.009926 0.049373 0.042184 1.000000

This summary indicates that there are a very small number of popular properties that are getting lots of reviews while the majority of them get only a handful of reviews.

4.4. Detection on Missing Values

List and compare the NA values and percentage of the final dataset before further feature engineering and analysis.

4.5. Feature Engineering

Although there are still some NA values in the final dataset, including review_scores_rating, first_review, last_review, and cleaning_fee, they will not significantly impaire our further analysis. Besides, if necessary in the future, we can consider impute the NA values in the cleaning_fee column, while the other three columns tends to be unimputable. We store those aggregate information in another dataset final_sum.

##     zipcode       price        cleaning_fee    availability_30 
##  10003  : 1   Min.   : 85.0   Min.   : 30.00   Min.   : 2.000  
##  10011  : 1   1st Qu.:147.1   1st Qu.: 41.25   1st Qu.: 4.970  
##  10013  : 1   Median :246.6   Median : 52.50   Median : 6.466  
##  10014  : 1   Mean   :237.0   Mean   : 58.75   Mean   : 9.453  
##  10021  : 1   3rd Qu.:327.6   3rd Qu.: 70.00   3rd Qu.:13.089  
##  10022  : 1   Max.   :412.5   Max.   :100.00   Max.   :29.500  
##  (Other):18                   NA's   :20                       
##  availability_60 availability_90 availability_365 number_of_reviews
##  Min.   : 9.00   Min.   :14.70   Min.   : 63.00   Min.   : 5.014   
##  1st Qu.:13.96   1st Qu.:22.47   1st Qu.: 94.27   1st Qu.:13.695   
##  Median :18.15   Median :28.93   Median :113.50   Median :19.257   
##  Mean   :22.87   Mean   :37.19   Mean   :138.49   Mean   :20.521   
##  3rd Qu.:30.46   3rd Qu.:51.73   3rd Qu.:164.44   3rd Qu.:25.852   
##  Max.   :56.00   Max.   :86.00   Max.   :360.50   Max.   :43.000   
##                                                                    
##   first_review  last_review  review_scores_rating      City    
##  Min.   : NA   Min.   : NA   Min.   : 89.00       Min.   : NA  
##  1st Qu.: NA   1st Qu.: NA   1st Qu.: 91.69       1st Qu.: NA  
##  Median : NA   Median : NA   Median : 92.38       Median : NA  
##  Mean   :NaN   Mean   :NaN   Mean   : 93.90       Mean   :NaN  
##  3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: 96.75       3rd Qu.: NA  
##  Max.   : NA   Max.   : NA   Max.   :100.00       Max.   : NA  
##  NA's   :24    NA's   :24    NA's   :18           NA's   :24   
##     SizeRank           cost           unique_num    
##  Min.   :   1.0   Min.   : 370164   Min.   :  1.00  
##  1st Qu.:  47.0   1st Qu.: 448856   1st Qu.:  7.75  
##  Median : 479.5   Median :1378655   Median : 72.50  
##  Mean   :1056.6   Mean   :1362156   Mean   : 65.71  
##  3rd Qu.:1762.2   3rd Qu.:2059063   3rd Qu.:105.50  
##  Max.   :4647.0   Max.   :3189236   Max.   :189.00  
##                                                     
##  neighbourhood_group_cleansed
##  Bronx        : 0            
##  Brooklyn     : 5            
##  Manhattan    :11            
##  Queens       : 1            
##  Staten Island: 7            
##                              
## 

From the summary above, we know that there are 24 zipcodes in New York City that have rental properties listed on Airbnb. Currently, there are at least 1 such property in any zipcode, while at most 189 such properties per zipcode.

III. Analysis and Visualization

1. Calculation of Three Financial Metrics

Having the business goal of finding the zipcode whose rental properties can generate the most short-term profits, it is necessary to have a metrix to measure the performance and short term potential of those zipcode in NYC. We will look at the revenue from monthly, quarterly, and yearly basis, with the assumption of having an occupancy rate of 75% among all time.

Besides, we will refer to two most important financial ratio in rental property business: the Capitalization Rate and the Rent Rate. In this case, they are calculated as:

\[{Cap\;Rate} = \frac{Annual\;Revenue}{Total\;Investment\;on\;Property}\]

\[{Rent\;Rate} = \frac{Monthly\;Revenue}{Total\;Investment\;on\;Property}\]

2. Visualization and Interpretation

In this section, we will look at the dataset from a general perspective at neighbourhood level and then dive deeper into those zipcodes that have at least 10 properties listed to find the ones that ultimately match our business goal in terms of three metrics: Number of properties, revenue, and Cap Rate. Once top zipcodes are chosed, they will then be cross-examined in three tests on property-type diversity, rental availabilty, and popularity (indicated by the frequency of reviews).

Before we start, let’s create a function that can:
1) sort the result in descending order
2) contains only the top 10 zipcodes where there are at least 10 properties listed on AirBnB.

1) Relationship Among Investment, Revenue, and Cap Rate

We create a bubble plot to see the trend among total investment (x-axis), annual revenue (y-axis) and Cap Rate (size of bubble) as below. Neighbourhood is also indicated by color.

In general, with high cost of property, the annual revenue is also high. However, this fact does not indicate that those properties also have a considerable Cap rate. For example, 10035 has the highest revenue of about 77M and a medium high cost of about 1.79B, it only has a Cap Rate of 4.3% (small bubble); however, 10309, on the other hand, has the highest Cap Rate of 23% but only 1.95M in revenue and 451K in investment. This discovery infers the trend among revenue, investment cost, and cap rate. It is in accordance with the investment reality, but doesn’t explicitly tell us which are the best zipcodes to invest for short-term profit.

3) Properties in each Zipcode VS. Total Properties in NYC

When comparing the properties in each zipcode to the total properties in NYC using both number and percentage.

The top 5 zipcodes with the most number of properties are: 11215, 10036, 10003, 10025, 11217

4) Top zipcodes by Revenue

In this part, we show two charts with one being quarterly revenue and another one being annual revenue. 10036 makes the most revenue with 3.5M per quarter and 14.3M per year. While top 5 zipcode all have more than 2.5M in quarterly revenue and 11M in annual revenue.

The top 5 zipcodes with the most revenue are: 10036, 10003, 10013, 10025, 10011

5) Top Zipcodes with highest Capitalization Rate & Rent Rate

The Capitalization Rate is very important to rental real estate investment because it helps to evaluate real estate based on its current value and its net operating income (in this case, annual revenue). It gives them an initial yield on an investment property. An investor can look at a rising cap rate for a property and see that there’s a rise in income relative to its price. In contrast, a fall in cap rate generally indicates that there is lower rental income compared to its price.

A good cap rate is typically higher than 4 percent which is also approximately the average Cap Rate in NYC.

As it shows in the graph, among those 10 zipcodes that have at least 10 rental properties listed, all of them have a Cap Rate that is above 4% while 11434 has the highest Cap rate of 9.99%.

This trend is the same in terms of Rent Rate, while all top 10 are around 0.4% and the highest being 11434 with 0.82%.

The top 5 zipcodes with the highest Cap Rate & Rent Rate are: 11434, 10305, 10025, 10036, 11217

6) Checkpoints for Top Zipcodes over Three Financial Metrics

In the above analysis, we have gathered three groups of top 5 zipcodes in terms of the number of properties, the amount of revenue, and the Cap Rate. Now we will see how frequent have those zipcodes scored in the top 5 range from those three perspectives.

As the table above shows, 10025 and 10036 are in the top 5 range for all three criterias, while 10003 and 11217 have matched two criterias. 10011, 10013, 10305, 11215, and 11434 appeared only once.

We will choose 10025, 10036, 10003, 11217 as the top zipcodes for now and further inspect them for property diversity, rental availability, and popularity in the following analysis.

3. Other Tests:

After choosing the 4 top zipcodes above, we have three more tests to check their performance on property diversity, rental occupancy, and online popularity.

1) Property Type vs. Investment Cost in difference zipcodes

We should also have an overview on property type to ensure the rental properties listed are diversified so one type of property will not be overly supplied.

As it shows, both Manhattan and Brooklyn have a good diversity on property type, and there is a clear price range difference for Apartment in those neighbourhoods. This fact once again raised the question of making choice on which neighbourhood to invest given that cash resources are usually limited.

Also, all four of our chosen top zipcodes - 10025, 10036, 10003, 11217 - are fairly diversified, while 10003 has 7 types of properties listed.
10025 has 2 types of properties listed (Apartment and Condominium);
10036 has 4 types listed (Apartment, Condominium, House, and Serviced Apartment);
10003 has 7 types listed (Apartment, Condominium, House, Loft, Other, Serviced Apartment, and Townhouse);
11217 has 6 types listed (Apartment, Condominium, Guest Suite, House, Loft, and Townhouse)

2) Rental Availability vs. Rental Price in Different Neighbourhood

In this case, we use availability_365 as the indicator for the rental availability. By looking into the trend between rental availability and the average rental price by zipcode in different neighbourhood, it tells us the possibility of finding a popular property that can expect high occupancy and generate the most profit per stay.

As it demonstrate, properties in Manhattan have highest range of rental price above $270 and is mostly more popular than other neighbourhoods. Properties in Brooklyn are very popluar in general and have a medium range of rental price from $130 to $250. Also, there is a clear price difference for different neighbourhood.

Also, three of our chosen top zipcodes - 10025, 10003, 11217 - have a high occupancy rate while 10036 only has a medium performance.

3) Popularity of Measured by Number of Reviews vs. Number of Weeks

Popularity is measured in terms of number of reviews and it is a proponent of time. We intuitively assume that more reviews and longer listeing time equate to popular properties and popular neighborhood/zipcode.

Top 10 zipcodes from the popular section are

Generally, Brooklyn and Manhattan have been having rental property listed on AirBnb for a longer time, and Brooklyn seems to get even more reviews than Manhattan.

As for the top 4 zipcode we have choosen, 10036 has the best performance because it has a relevantly high number of reviews although its listing time the shortest among those 4 zipcode. See below:

10025 has been listed for more than 74 weeks and have received about 19 reviews
10003 has been listed for more than 96 weeks and have received about 32 reviews
11217 has been listed for more than 97 weeks and have received about 27 reviews
10036 has been listed for more than 66 weeks and have received about 27 reviews

4. Conclusion

Based on the analysis over the number of properties, the amount of revenue, and the Cap Rate, we first gathered three groups of top 5 zipcodes. Then, we chosed 4 zipcodes because of their high performance accross those three metrics. Those 4 zipcodes are: 10003, 10025, 10036 in Manhattan, and 11217 in Brooklyn. 10025 and 10036 ranked top 5 for all three metrics while 10003 and 11217 ranked top 5 for two out of three metrics.

Although all of them have performed well in the following tests on property diversity, rental availability, and online popularity; none of them has a distinguishable performance accross those three tests this time.

A detailed comparison is provided as below:

##                                       X1          X2          X3
## zipcode                            10003       10025       10036
## neighbourhood_group_cleansed   Manhattan   Manhattan   Manhattan
## unique_num                           138         126         149
## price                           325.8913    332.7143    351.2215
## cost                             2214644     1491193     1731224
## Rent_Rate                    0.003310941 0.005020189 0.004564680
## Cap_Rate                      0.04028311  0.06107896  0.05553694
## Revenue_quarter                  3035678     2829735     3532410
## Revenue_semiyear                 6071355     5659470     7064820
## Revenue_year                    12311359    11476148    14325885
## availability_365               106.51449    96.54762   163.36242
## number_of_reviews               23.60145    14.53175    19.28188
##                                       X4
## zipcode                            11217
## neighbourhood_group_cleansed    Brooklyn
## unique_num                           125
## price                           243.8960
## cost                             1322669
## Rent_Rate                    0.004148928
## Cap_Rate                      0.05047863
## Revenue_quarter                  2057872
## Revenue_semiyear                 4115745
## Revenue_year                     8345816
## availability_365                70.85600
## number_of_reviews               22.36800

IV. Final Thoughts

1. Analysis Summary

For this analytics project, the objective is to identify top zipcodes for our real estate clien who wants to invest in 2 bedrooms rental properties in New York city with the goal of generating considerable short-term gains. First, pre-obtained datasets from AirBnb and Zillow were pre-processed, cleaned and aggregated at zipcode level. Then, they were filtered reduce the redundancy and merged together by zipcode to get a single data table containing all necessary information. After that, Exploratory Data Analysis was conducted using interactive graphs and visuals to figure out which zipcodes are high rewarding.

As for the result, we suggest that 10025 and 10036 are the best two zipcodes to invest becuase they both have a substaintial number of properties listed and a high potential of return on investment regarding its revenue and Cap Rate. Moreover, they both have faily good performance in terms of property diversity, rental occupancy, and online popularity. Good performance in those merits and tests indicate that these two zipcodes are more likely to meet our business goal of generating short-term gains. 10003 and 11217 should also be considered carefully because they also have a similar performace and have a recognizable advantage over other zipcodes.

2. Looking Forward

There are also several changes that could be adopted to make our analsis even more thorough in the future. They are:

  1. In the final datasets, some of the NA values can be imputed if necessary.

  2. Square feet data is significantly missing in this case but could be useful when making a real estate investment decision. External data sources can be considered if necessary.

  3. Although this analysis has a business context of making short-term profit, we could still consider longer term profitibility as a reference.

  4. Better prediction on investment cost, and consider the possibility of the property appreciation.

  5. The metrics of rental availability and popularity can be modified to better reflect the performance.

  6. Investment portofolio could be formed based on finding the best combination of different property setting in various zipcodes.

Appendix

1. Missing value summary for the original airbnb dataset

2. Missing value summary for the original zillow dataset